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2 COMPUTATION AND ALGORITHMIC DIFFERENTIATION 



1 Introduction 

This paper is concerned with the efficient evaluation of higher-order derivatives of functions / that are com- 
posed of matrix operations. I.e., we want to compute the D-th derivative tensor 

V D f(X) e R N ° , 

where / : M. N — > E is given as an algorithm that consists of many matrix operations. We propose a method 
that is a combination of two well-known techniques from Algorithmic Differentiation (AD): univariate Taylor 
propagation on scalars (UTPS) [GW08, GWU98] and first-order forward and reverse on matrices [Gil08J. 
The combination leads to a technique that we would like to call univariate Taylor propagation on matrices 
(UTPM). The method inherits many desirable properties: It is easy to implement, it is very efficient and it 
returns not only V D f but yields in the process also the derivatives V f for d < D. As performance test we 
compute the gradient V/(X) of f(X) — tr(Z _1 ) in the reverse mode of AD for X E M. nxn . We observe a 
speedup of about 100 compared to UTPS. Due to the nature of the method, the memory footprint is also small 
and therefore can be used to differentiate functions that are not accessible by standard methods due to limited 
physical memory. 

The following sections are structured as follows: In Sect. [2] we give a brief explanation of the key ideas of 
AD. In Sect. |3] we give a summary of UTPS which is then used in Sect. @] where the forward and reverse 
mode of AD are explained. In Sect. [5] we show how the forward and reverse mode can be combined to 
compute higher order derivatives. Sect. [6] serves as motivation for UTPM. In Sect. [7] the central idea of 
UTPM is introduced. Section [8] shows how this idea is applied to the reverse mode of AD followed by Section 
[9]where the combination of forward and reverse mode on matrices is explained. In Sect. [K)]we briefly discuss 
the complexity of UTPM compared to UTPS and in Sect. \TT\ we show how our proposed method performs 
compared to existing state-of-the-art methods in practice. 



2 Computation and Algorithmic Differentiation 

A program is a sequence of instructions that a computer can interpret step by step. Generally, functions 
of practical interest in science and engineering can be evaluated as a program. Mathematically speaking, 
such functions are composite functions of elementary functions. The definition of elementary is not strict. 
In fact, only the four operators +, — , x, / are really elementary: they are required to define the field of real 
numbers JR.. The theory of Algorithmic Differentiation (AD) is the application of the chainrule to the sequence 
of elementary functions. In the context of AD, we mean by elementary functions such functions that have 
"nice" analytical properties, i.e. can be differentiated analytically. In formulas, we want to evaluate functions 
F : M. N — > R M that are built of elementary functions <j>: 

F : x i — ► y = F[x) , 
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3 UNIVARIATE TAYLOR PROPAGATION ON SCALARS 



where x= (xi, . . . ,x/v), y = (ji, . . . ,)>m)- If Af = 1 we use /instead of F. For example /(x 1,^2) = ;q *X2 +*i 
can be written as 

f(x h x 2 ) = ^3(^1(^1,^2), ^2(^1)) = 03(vi,v 2 ). 
We use the notation v/ for the result of (j)[ and vj^i for all arguments of 0/. To be consistent, the independent 
input arguments x n are also written as v n -u = x n . To sum it up, the following three equations describe the 
function evaluation: 

v n -N = *n n = l,...,N (2a) 

Vi = 0/(vy^) Z=1,...,L (2b) 
yjf-m = v L - m m = M-\,...,0, (2c) 

where L is the number of calls to elementary functions (j>i during the computation of F (L = 3 in the above 
example). Running indices (n,m, I) use the same letter as the boundary values (N,M,L) to make the notation 
easier to read. The sequence can also be represented by a computational graph, as depicted in Fig. [Q 

X = X*Y 

X = X.dot(Y) + X. transpose () 
X = Y + X * Y 

Y = X.invO 

Y = Y. transpose () 
Z = X * Y 
TR = Z. trace () 

eg. independentFunctionList = [X, Y] 
cg.dependentFunctionList = [TR] 

Figure 1 : The computational graph on the left side is defined by the computer program on the right side. The 
variables X and Y are matrices. The squares represent function nodes. The numbers represent the occurrence 
in the sequence of successive operations. Independent variables are represented as circles. 

To differentiate such a program given as sequence of elementary functions 0/ the chain rule is applied to each 
elementary function 0/: 

#/(Vj^z) = 

]-<l J 

That means that the evaluation of the derivative of F breaks down to differentiating the elementary functions 
0/. In contrast to symbolic differentiation, rather than the symbolic expression, the numerical value is propa- 
gated. In the following sections we will concentrate on one elementary function f = and keep in mind that 
we have then treated arbitrary functions that are composed of such elementary functions. 

3 Univariate Taylor Propagation on Scalars 

In this section it is explained how higher order derivatives of functions F : M. N — > K M can be computed by 
means of Univariate Taylor Propagation on Scalars (UTPS). This theory has been successfully implemented 
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4 THE FORWARD AND REVERSE MODE ON SCALARS 



in software by use of operator overloading, for example ADOL-C HGJU:96H or CppAD [Be09]. The key 
observation is that the propagation of a univariate truncated Taylor polynomial xo + tE To of degree D 
through a function / : R — > R yields the derivatives d d f, < d < D: 



D 



1 



f(xo + t) = Z-d d f(xo)t d + 0(t D+l ). 



d=0 



In the application of the chain rule to the elementary functions the Taylor coefficients v l d are filled with 
non-zero entries. In general, UTPS is given by 



t ydt d = f(t x d fC 

d=0 d=Q 



D i d d D 
d=0 " ■ u ' c=0 



t d + . 



t=0 



The explicit formulas of yd for d = 0, . . . ,D have to be calculated analytically. For some simple functions 
explicit expressions can be obtained: See TableQ]for some examples. To ease the notation we sometimes use 
[•] when we mean a univariate Taylor polynomial. E.g. [x] := T,d=o x d t ■ 



(j){u,v) 


d = 0,...,D 


u + cv 
uxv 

u/v 


<l>d = Ud+CVd 
$d=I?j=0UjV d -j 

fe = 4 u d-Ljz l <i>jVd-j 



Table 1: UTPS of the binary functions G {+, x, /}. This table summarizes how the Taylor coefficients ^ 
in Ld=o<l>dt d = <l>(Ld=o u d td \Ld=o v dt d ) are computed. 



4 The Forward and Reverse Mode on Scalars 

To compute first order derivatives, it is favorable to use the reverse mode of AD when M < N. However, this 
rule is only valid when a algorithmic complexity model is used that discards memory movements. In practice, 
memory movements are not only a minor correction to the actual runtime on a computer, but in fact a major 
contributor. The rule of thumb is therefore: if M < 5N then the reverse mode is most likely favorable. 

The forward mode propagates directional derivatives. I.e. it applies the chain rule starting at 0o- This can 
easily be done with UTPS as explained in the previous section. The reverse mode computes derivative vectors 
by applying the chain rule starting at i.e. compute 




The recursion continues by applying the chainrule to dx n = dx n (z). The recursion is stopped if x n is an 
independent variable. 
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5 COMBINING FORWARD AND REVERSE MODE 



Example We want to compute the function f(g(x),y) = g(x)y = x 2 y. In the forward mode we compute the 
directional derivative i£(x,y) = [l,0]V/(jc,y) : 

[x] = [x,l];[y] = [y,0] 
g ([ x ]) = [x] 2 = [x,l][x,l] = [x\2x] 

f{\g\M = [g][y] = [x 2 M[yM = [x 2 yaxy] 

where f\ = 2xy is the wanted directional derivative. 
In the reverse mode we compute: 



df(g,y) = §^(z>jO 



dg + ^-dy 

= y dg+ g dy 



=■8 y 
= glxdx + ydy, 

=:x 

where the gradient of f(x,y) and can be read from x and y: 

5 Combining Forward and Reverse Mode 

To compute higher-order derivatives efficiently, one can combine forward and reverse mode. The important 
observation is that one can differentiate functions F : — ► that propagate univariate Taylor polynomials 
in the forward mode once more in the reverse mode. In consequence one obtain obtains derivatives of degree 
D + 1 . The combination relies on the interchangeability of the differential operators d and ^ : 



d=0 " • " l c=0 



f=0 d =° =:G 



i.e. to compute one higher order of derivatives with the reverse mode one can symbolically differentiate F to 
obtain G = dF and then use UTPS on G. That we obtain one higher order of derivatives can be seen from 
Eqn. ©. 

Example The goal is to compute the Hessian-vector product H ■ v at x = (2,3,7) r with v = (1,0, 0) r . The 
Hessian is defined by the function 



f-.m 3 



x ^ y = f(x) = xix 2 X3 
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5 COMBINING FORWARD AND REVERSE MODE 



and reads 

/ x 3 x 2 \ 
H=lx 3 xi . 

\X 2 X\ / 

I.e., we want to compute the first column (0,X3,X2) = (0,7, 3) of the Hessian. 



[V-2] : 


= N 


= [2,1] 




[v-i] = 


= N 


= [3,0] 




[vo] 


= N 


= [7,0] 




[vi] = 


= [V-2][V-l] 


= [2,1] [3,0] = 


[6,3] 




= [vi] [vo] 


= [6, 3] [7,0] = 


[42,21] 


fa] 


= bl 


= [i,o] 




[vi] : 


= [V2] [vo] 


= [1,0][7,0] = 


[7,0] 


[vo] 


= N[vi] 


= [1,0][6,3] = 


[6,3] 


[v-i] : 


= [Vl] [V-2] 


= [7,0] [2,1] = 


[14,7] 


[ v -2] : 


= [vi][v-i] 


= [7,0] [3,0] = 


[21,0] 


[X] : 


= [V-2] 


= [21,0] 




M ■ 


= [v-i] 


= [14,7] 




M ■ 


= [vo] 


= [6,3] 





The brackets [x\] denote truncated Taylor series. The purpose of the first three lines is solely to make the 
notation consistent. The next two lines are the FDE where the multiplication between truncated Taylor series 
as explained in Tabled] has been used. Then the first adjoint variable [y] is defined. From there, the adjoints 
are computed in reverse order. Finally, in the last three lines the adjoint variables are renamed. The first 
Taylor coefficient of x,y,z are the first column of H, i.e. (//n,//2i,#3i) = (xi,y~i,zi) 

We obtain 

d[f] = dsin([y]) 

= cos([y])d[y] 

= [cos(v ),-sin(y )yi]d[v] 

S v ' 

=m 

= [y]dexp([x]) 

= [y]exp([x])d[x] 

= [y][exp(x ),exp(x )xi]d[x] 

= [cos(yo)exp(xo),cos(yo)exp(x )xi -sin(y )yiexp(x )]d[x] 

= [x]d[x] , 

where we find that [jc] = 
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7 UNIVARIATE TAYLOR PROPAGATION ON MATRICES 



6 Algorithmic Differentiation on Matrices 

The theory of matrix differential calculus is well-known in the statistics and econometrics community and 
there are a number of textbooks and papers available, e.g. [MN99, MinOOJ and references therein. Our work 
is based on the tutorial paper Collected Matrix Derivative Results for Forward and Reverse Mode Algorithmic 
Differentiation by M. Giles [Gil08|. The need for higher order derivatives of matrix operations arises for 
example in optimal experimental design (OED) problems. The OED objective function 4> is a function that 
depends on the covariance matrix C £ M. n p xN p of the parameters p £ ~R n p. The covariance matrix C is itself 
a complicated expression in 7 = (J\,J2), where 7i £ M, n m* n p [ s the sensitivity of the measurement model 
functions and J2 £ M. NcxN p the sensitivity of the constraint functions w.r.t. the parameters p. In particular, the 
following NLP has to be solved w.r.t. the control variables q: 

<1* = argmin ?£5cR ^4>(C(7(^))) , 

--(-)(t'iHts)(f fHo)- 

Typical NLP solvers need at least gradients of the objective function <f> and one therefore has to differentiate 
the above matrix operations. In robust settings the objective function often requires higher order derivatives 
of matrix operations. If there are no constraints in the parameter estimation, the above expression simplifies 
to 

4>(C) = tr(C)=tr((7 r 7) _1 ) . (8a) 

The sequence of operations needed in the reverse mode of AD for Eqn. © is shown in Table [2] This also 
motivates the test function in Sect. Qj] which is part of the sequence in Tabled 



7 Univariate Taylor Propagation on Matrices 



There are two possibilities how to differentiate matrix operations: Either one regards matrices as two-dimensional 
arrays and differentiates the linear algebra algorithms, or one considers matrices as elementary objects and 
applies matrix calculus. Using UTPS on the first possibility results in the following formal procedure: 



Ll 



[Yn y \ 



[YlMy] 



[Yn y m y ] 



( 



\ 



ll 



[ X 1M X ] 



\X, 



AfcU 



P0v x M x ]J / 



where Af is the number of rows and M the number of columns. A simple reformulation transforms a matrix of 
truncated Taylor polynomials into a truncated Taylor polynomial of matrices: 



ll+d 



IM+d' 



vNl+d 
d=0 A d 1 



NM f d 
d 1 . 



D 

E 

d=0 



A d 



A d 



vNM 
d 



(10a) 
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9 HIGHER ORDER MATRIX DERIVATIVES 



We denote from now on the rhs of Eqn. (flOl) as [X], The formal procudure then reads 

[Y]=F([X]) + ff(t D+1 ), 

which can be treated with matrix calculus. We'd like to call this approach Univariate Taylor Propagation on 
Matrices (UTPM). Notice that even square matrices only form a noncommutative ring. 



8 Reverse Mode on Matrices 



Applying the reverse mode to an objective function with matrix argument yields 



4> d<N Y 



_ d<i> 



JiVxM 



BY 



(11a) 



tr 



\ 



r d<S> 


-| 


dY n ■ 


' dY w 


d<i> 


<3<J> 


ldY Ml ■ 


dYMN-l 



dY m 



dY lM 
dY NM 



\ 



:Y T eR MxN 



(lib) 



/ 



= tr(? r dy). (11c) 

From that point, one has to successively go backward and find the dependency w.r.t. the arguments X of 
Y = Y(X). The reverse mode for the inverse of a matrix Y = X~ l , transpose of a matrix Y =X T , trace of a 
matrix y = tx(X) and the matrix matrix multiplication Z = XY are given by [Gil08J: 

1 : tr(f r dy) =tr(-YY T YdX) 

=:X T 

tr(? r dy) =tr(^^dX) 

y = tr(X) : jdtr(X) =tr(jIdZ) 

Z = XY : tr(Z r dZ) =tr (YZ T dX + Z T XdY) . 



Y=X 



Y =X T : 



9 Higher Order Matrix Derivatives 

To compute higher order derivatives d D 4> one can apply UTPM and then use the reverse mode as shown in 
the previous section. In formulas 

[4> r ]d4»([X]) = tr([l r ]d[Z]), 



where we have defined [X] and [dX] as Taylor polynomials of matrices as introduced in Sect. [7J 
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1 ALGORITHMIC COMPLEXITY OF UTPS VS UTPM 



vo 


= y 


v 4 




4> 


Vl 


= vj 


V3 


+= 


V4I 


v 2 


= V1-V2 


V2 


+= 


r - r 


V3 


= (V2)- 1 


Vl 


+= 


v 2 v£ 


V 4 


= tr(v 3 ) 


vo 


+= 


vl v 2 






vo 


+= 


-T 



Table 2: This table shows how the gradient of Eqn. ([8]) is computed in the reverse mode of AD. The left side 
is the function evaluation. All temporary results v/ are saved in memory. They are required in the reverse 
sweep that is shown on the right side. The operations needed in the reverse sweep are defined in Eqn. dMI]). 
The final derivative can be read from vo = V4>. 

Example: Forward UTPM for the Matrix Inversion We want to compute [X]~ l , where the constant term 
X e R NxN is regular. I.e. we have to find [Y] = [X}~ 1 s.t. 

1 = mm = {t x A [t Y A = t (t^d-e) t d +0{t D+l ) . 

\d=0 J \e=0 J d=0 \e=0 / 

The Taylor coefficients can now be computed recursively: 

: X Y 
1 : X Yi +X l Y 

2: X Y2+XiY l +X 2 Yo 

^Y 2 

d 

d : J! X e Yd_ e 

e=Q 



One can see that the inversion has only to be performed once. If D was large, techniques as used in the fast 
Fourier transform could be applied. However, typically D < 4. 

10 Algorithmic Complexity of UTPS vs UTPM 

In the literature polynomial matrix computations have been thoroughly treated, c.f. e.g. HGJV031ICK91H and 
references therein. These publications put more focus on the algebraic complexity theory, are only suitable 
for large degree D or use an unsuitable complexity measure for our purposes. Here we keep things simple to 
highlight the difference to the traditional approach in AD theory. 



= 1 ^Y =X Q l 
= 

= — X Q X{Yq 
= 

= -X l +X 2 Y ) 
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1 ALGORITHMIC COMPLEXITY OF UTPS VS UTPM 



In the theory of AD one traditionally differentiates the algorithms of matrix operations to compute derivatives. 
For naive implementations of the matrix addition and multiplication the approach of UTPS is equivalent to 
UTPM when complexity measures neglecting memory movements are used. More sophisticated algorithms, 
e.g. the matrix inversion, result in algorithms that are significantly different in the complexity. The computa- 
tional cost OPS to compute the whole Taylor series of the matrix inversion is 

D 

OPSQX]- 1 ) = OPS(-X _1 ) + £((</+l)OPS(Afl) + (d-l)OPS(A+fl)) 

d=i 

= OPS(-Z~ 1 ) + (D + 3 ) D QPS(A£) + (D ~ 1)D OPS(A + £) . 



The matrix addition is ff{N 2 ) and the matrix multiplication is ff(N 3 ). We therefore have a computational cost 
that scales as <ff(D 2 N 3 ). 

Differentiating an algorithm that inverts a matrix requires overloading of the scalar multiplication and addi- 
tion. The multiplication of two Taylor polynomials needs 

D d D 

£ OPS(J> e y rf _ e ) = £ JOPS(x + y) + (J+l)OPS(*y) 

d=0 e=0 d=0 

- ' D+ " D ops(, +y ) + ' D+2 '< D+ "ops to ) 



2 v J ' 2 

operations and the addition £g=o 0FS ( x d +yd) = (D+1) OPS(jc + y). I.e., UTPS needs 

OPSQX-i]) = OPS(* ; X-i) OPS(, + y) + (D + 2 f +1) QPSQcy) 

+ OPS(+,X- 1 ) ((D+ 1) OPS(jc + y)) 

operations in total. The quantities OPS(*,X _1 ) and OPS(+,X _1 ) are the number of multiplications resp. 
additions in the matrix inversion. This total operations count can be but does not necessarily has to be the 
same as Eqn. (fT2l) . In the leading powers it is also 0(N 3 D 2 ). However, there are several reasons why on a 
computer there will be significant differences: The complexity model of counting the operations is inadequate 
for real computers. One has to consider the cache hierarchy and that a memory access has a latency and a 
bandwidth that falls behind the speed of the CPU. In the reverse mode of UTPS, many operations that could 
be computed as one instruction (due to the linearity in the linear algebra) are fetched from the memory. E.g. 
for the function f(X) = X~ l one has OPS(/) = ^(N 3 ) and therefore the memory requirement using UTPS 
is MEM(V/) = 0(N 3 ) but only MEM(V/) = &(N 2 ) when using UTPM. Also, when UTPS is applied to 
matrix algorithms, assumptions that were made to make those algorithms fast are no longer valid. E.g. the 
multiplication of two truncated Taylor polynomials is much more expensive than the addition. Also, due to 
the unknown degree D it is hard to write tuned algorithms as in ATLAS to avoid cache misses. Of particular 
importance is also the reduced memory requirement in the reverse mode of AD since using UTPM does not 
require to tape intermediate results that are used in the linear algebra functions. 
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1 1 EXPERIMENTAL PERFORMANCE COMPARISON 
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Figure 2: This Figure shows a runtime comparison between UTPM implemented using LAPACK and UTPS 
implemented with ADOL-C resp. UTPS implemented with Tapenade. In the left plot one can see that taping 
the function f(X) = tr(X _1 ) with ADOL-C is much slower than a function evaluation. The runtime explosion 
at N = 150 is a results from read/write access to the harddisk due to insufficient physical memory. It also 
shows that our implementation of the QR decomposition is about 5 to 10 times slower than LAPACK7 ATLAS. 
In the right plot one can see that the UTPS approach of both Tapenade and ADOL-C are much slower than 
UTPM, even if our non-optimal implementation is accounted for. 



11 Experimental Performance Comparison 



To compare the performance of UTPM to state-of-the-art approaches with UTPS we use an easy but suffi- 
ciently complex example for the case D = 1 has been implemented. The code is available at [SC09|. The goal 
is to compute the derivative V/ e R NxN of / : R NxN -> R 

X ^ f(X)=tr(X- 1 ). 

Since LAPACK code could not readily be differentiated with ADOL-C or Tapenade, we implemented the 
matrix inversion by QR decomposition using Givens rotations. This code was then taped with ADOL-C and 
differentiated in reverse mode. Taping refers to the process of recording the intermediate values v/ that are 
needed in the reverse mode of AD. Since Tapenade was not able to differentiate the C++ code necessary for 
ADOL-C we also implemented it in Fortran 77. The results are depicted and explained in Fig. [21 
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